Unsupervised Learning: Trade&Ahead¶

Marks: 60

Context¶

The stock market has consistently proven to be a good place to invest in and save for the future. There are a lot of compelling reasons to invest in stocks. It can help in fighting inflation, create wealth, and also provides some tax benefits. Good steady returns on investments over a long period of time can also grow a lot more than seems possible. Also, thanks to the power of compound interest, the earlier one starts investing, the larger the corpus one can have for retirement. Overall, investing in stocks can help meet life's financial aspirations.

It is important to maintain a diversified portfolio when investing in stocks in order to maximise earnings under any market condition. Having a diversified portfolio tends to yield higher returns and face lower risk by tempering potential losses when the market is down. It is often easy to get lost in a sea of financial metrics to analyze while determining the worth of a stock, and doing the same for a multitude of stocks to identify the right picks for an individual can be a tedious task. By doing a cluster analysis, one can identify stocks that exhibit similar characteristics and ones which exhibit minimum correlation. This will help investors better analyze stocks across different market segments and help protect against risks that could make the portfolio vulnerable to losses.

Objective¶

Trade&Ahead is a financial consultancy firm who provide their customers with personalized investment strategies. They have hired you as a Data Scientist and provided you with data comprising stock price and some financial indicators for a few companies listed under the New York Stock Exchange. They have assigned you the tasks of analyzing the data, grouping the stocks based on the attributes provided, and sharing insights about the characteristics of each group.

Data Dictionary¶

  • Ticker Symbol: An abbreviation used to uniquely identify publicly traded shares of a particular stock on a particular stock market
  • Company: Name of the company
  • GICS Sector: The specific economic sector assigned to a company by the Global Industry Classification Standard (GICS) that best defines its business operations
  • GICS Sub Industry: The specific sub-industry group assigned to a company by the Global Industry Classification Standard (GICS) that best defines its business operations
  • Current Price: Current stock price in dollars
  • Price Change: Percentage change in the stock price in 13 weeks
  • Volatility: Standard deviation of the stock price over the past 13 weeks
  • ROE: A measure of financial performance calculated by dividing net income by shareholders' equity (shareholders' equity is equal to a company's assets minus its debt)
  • Cash Ratio: The ratio of a company's total reserves of cash and cash equivalents to its total current liabilities
  • Net Cash Flow: The difference between a company's cash inflows and outflows (in dollars)
  • Net Income: Revenues minus expenses, interest, and taxes (in dollars)
  • Earnings Per Share: Company's net profit divided by the number of common shares it has outstanding (in dollars)
  • Estimated Shares Outstanding: Company's stock currently held by all its shareholders
  • P/E Ratio: Ratio of the company's current stock price to the earnings per share
  • P/B Ratio: Ratio of the company's stock price per share by its book value per share (book value of a company is the net difference between that company's total assets and total liabilities)

Importing necessary libraries and data¶

In [8]:
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd

# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_theme()

# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)

# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)

# to scale the data using z-score
from sklearn.preprocessing import StandardScaler

# to compute distances
from scipy.spatial.distance import pdist, cdist

# to perform k-means clustering, compute metric
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# to perform hierarchical clustering, compute cophenetic correlation, and create dendrograms
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage, cophenet

#!pip install yellowbrick
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer

# to perform PCA
from sklearn.decomposition import PCA

# to disable warnings
import warnings
warnings.filterwarnings('ignore')
In [9]:
# loading the dataset
data = pd.read_csv("stock_data.csv")

Data Overview¶

  • Observations
  • Sanity checks
In [10]:
data.shape
Out[10]:
(340, 15)
  • The dataset has 340 rows and 15 columns
In [11]:
# viewing a random sample of the dataset
data.sample(n=10, random_state=1)
Out[11]:
Ticker Symbol Security GICS Sector GICS Sub Industry Current Price Price Change Volatility ROE Cash Ratio Net Cash Flow Net Income Earnings Per Share Estimated Shares Outstanding P/E Ratio P/B Ratio
102 DVN Devon Energy Corp. Energy Oil & Gas Exploration & Production 32.000000 -15.478079 2.923698 205 70 830000000 -14454000000 -35.55 4.065823e+08 93.089287 1.785616
125 FB Facebook Information Technology Internet Software & Services 104.660004 16.224320 1.320606 8 958 592000000 3669000000 1.31 2.800763e+09 79.893133 5.884467
11 AIV Apartment Investment & Mgmt Real Estate REITs 40.029999 7.578608 1.163334 15 47 21818000 248710000 1.52 1.636250e+08 26.335526 -1.269332
248 PG Procter & Gamble Consumer Staples Personal Products 79.410004 10.660538 0.806056 17 129 160383000 636056000 3.28 4.913916e+08 24.070121 -2.256747
238 OXY Occidental Petroleum Energy Oil & Gas Exploration & Production 67.610001 0.865287 1.589520 32 64 -588000000 -7829000000 -10.23 7.652981e+08 93.089287 3.345102
336 YUM Yum! Brands Inc Consumer Discretionary Restaurants 52.516175 -8.698917 1.478877 142 27 159000000 1293000000 2.97 4.353535e+08 17.682214 -3.838260
112 EQT EQT Corporation Energy Oil & Gas Exploration & Production 52.130001 -21.253771 2.364883 2 201 523803000 85171000 0.56 1.520911e+08 93.089287 9.567952
147 HAL Halliburton Co. Energy Oil & Gas Equipment & Services 34.040001 -5.101751 1.966062 4 189 7786000000 -671000000 -0.79 8.493671e+08 93.089287 17.345857
89 DFS Discover Financial Services Financials Consumer Finance 53.619999 3.653584 1.159897 20 99 2288000000 2297000000 5.14 4.468872e+08 10.431906 -0.375934
173 IVZ Invesco Ltd. Financials Asset Management & Custody Banks 33.480000 7.067477 1.580839 12 67 412000000 968100000 2.26 4.283628e+08 14.814159 4.218620
In [12]:
# copying the data to another variable to avoid any changes to original data
df = data.copy()
In [13]:
# checking datatypes and number of non-null values for each column
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 340 entries, 0 to 339
Data columns (total 15 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Ticker Symbol                 340 non-null    object 
 1   Security                      340 non-null    object 
 2   GICS Sector                   340 non-null    object 
 3   GICS Sub Industry             340 non-null    object 
 4   Current Price                 340 non-null    float64
 5   Price Change                  340 non-null    float64
 6   Volatility                    340 non-null    float64
 7   ROE                           340 non-null    int64  
 8   Cash Ratio                    340 non-null    int64  
 9   Net Cash Flow                 340 non-null    int64  
 10  Net Income                    340 non-null    int64  
 11  Earnings Per Share            340 non-null    float64
 12  Estimated Shares Outstanding  340 non-null    float64
 13  P/E Ratio                     340 non-null    float64
 14  P/B Ratio                     340 non-null    float64
dtypes: float64(7), int64(4), object(4)
memory usage: 40.0+ KB
  • Ticker Symbol, Security, GICS Sector & GICS Sub Industry are of object type & rest are either float or integer type
  • There are no NULL values in the dataset
In [14]:
# Ticker_Symbol is an abbreviation to identify the stocks, and is not required for further analysis
# Dropping Ticker Symbol
df.drop("Ticker Symbol", axis=1, inplace=True)
In [15]:
# We should convert the object type columns to categories, which reduces the memory required to store the dataframe.
# convert all columns with dtype object into category
for col in df.columns[df.dtypes=='object']:
    df[col] = df[col].astype('category')
In [16]:
# confirm conversions and dropped variable
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 340 entries, 0 to 339
Data columns (total 14 columns):
 #   Column                        Non-Null Count  Dtype   
---  ------                        --------------  -----   
 0   Security                      340 non-null    category
 1   GICS Sector                   340 non-null    category
 2   GICS Sub Industry             340 non-null    category
 3   Current Price                 340 non-null    float64 
 4   Price Change                  340 non-null    float64 
 5   Volatility                    340 non-null    float64 
 6   ROE                           340 non-null    int64   
 7   Cash Ratio                    340 non-null    int64   
 8   Net Cash Flow                 340 non-null    int64   
 9   Net Income                    340 non-null    int64   
 10  Earnings Per Share            340 non-null    float64 
 11  Estimated Shares Outstanding  340 non-null    float64 
 12  P/E Ratio                     340 non-null    float64 
 13  P/B Ratio                     340 non-null    float64 
dtypes: category(3), float64(7), int64(4)
memory usage: 46.7 KB
In [17]:
# fixing column names
df.columns = [c.replace(" ", "_") for c in df.columns]
In [18]:
# lets check duplicate observations
df.duplicated().sum()
Out[18]:
0
  • There are no duplicate records in the dataset
In [19]:
# lets check total null values
df.isnull().sum().sum()
Out[19]:
0
  • There are no null or missing values in the dataset
In [20]:
# Let's look at the statistical summary of the data
df.describe(include="all").T
Out[20]:
count unique top freq mean std min 25% 50% 75% max
Security 340 340 3M Company 1 NaN NaN NaN NaN NaN NaN NaN
GICS_Sector 340 11 Industrials 53 NaN NaN NaN NaN NaN NaN NaN
GICS_Sub_Industry 340 104 Oil & Gas Exploration & Production 16 NaN NaN NaN NaN NaN NaN NaN
Current_Price 340.0 NaN NaN NaN 80.862345 98.055086 4.5 38.555 59.705 92.880001 1274.949951
Price_Change 340.0 NaN NaN NaN 4.078194 12.006338 -47.129693 -0.939484 4.819505 10.695493 55.051683
Volatility 340.0 NaN NaN NaN 1.525976 0.591798 0.733163 1.134878 1.385593 1.695549 4.580042
ROE 340.0 NaN NaN NaN 39.597059 96.547538 1.0 9.75 15.0 27.0 917.0
Cash_Ratio 340.0 NaN NaN NaN 70.023529 90.421331 0.0 18.0 47.0 99.0 958.0
Net_Cash_Flow 340.0 NaN NaN NaN 55537620.588235 1946365312.175789 -11208000000.0 -193906500.0 2098000.0 169810750.0 20764000000.0
Net_Income 340.0 NaN NaN NaN 1494384602.941176 3940150279.327936 -23528000000.0 352301250.0 707336000.0 1899000000.0 24442000000.0
Earnings_Per_Share 340.0 NaN NaN NaN 2.776662 6.587779 -61.2 1.5575 2.895 4.62 50.09
Estimated_Shares_Outstanding 340.0 NaN NaN NaN 577028337.75403 845849595.417695 27672156.86 158848216.1 309675137.8 573117457.325 6159292035.0
P/E_Ratio 340.0 NaN NaN NaN 32.612563 44.348731 2.935451 15.044653 20.819876 31.764755 528.039074
P/B_Ratio 340.0 NaN NaN NaN -1.718249 13.966912 -76.119077 -4.352056 -1.06717 3.917066 129.064585
  • No security/ company is repeated twice
  • There are 11 unique GICS_Sector, and Industrials is the most common of the same
  • There are 104 unique GICS_Sub_Industry, and Oil & Gas Exploration & Production is the most common of the same
  • The average Current_Price is $80.86 while the median Current_Price is $59.75. This indicates data is right skewed with some high outliers
  • The average Price_Change (% change in price in the last 13 weeks) & average Volatility (Std deviation of price over last 13 weeks) are $4.078 and 1.5 while the median values are $4.819 and 1.38
  • Average ROE (return on equity) is 39.59 while the median ROE is only 15. This indicates data is right skewed with some outliers. A higher ROE indicates that a company is able to generate higher profits without needing much capital and is prefered by investors
  • Cash ratio is a measure of the company to pay off short term liabilities with cash & cash equivilaents. While a high cash ratio is prefered by investors, a very high cash ratio, may also indicate that cash is not being re invested or is not returning further interest. The average is 70.2 and median is 47
  • The average Net_Cash_Flow is $55537620 while median Net_Cash_Flow is $2098000.0 (right skewed, high outliers)
  • The average Net_Income is $1494384602.941176 while median Net_Income is $707336000.0 (again right skewed, high outliers). The Net Cash flow is the amount of money that actually comes in and goes out of a business during a period of time. Net income is the profit or loss that a business has after subtracting all expenses from the total revenue. Higher value of both is prefered
  • The average Earnings_Per_Share (net profit/shares outstanding) is 2.77 while median is 2.895 (left skewed)
  • The average Estimated_Shares_Outstanding is 577028337.754029 while median is 309675137.8 (right skewed with outliers)
  • The average P/E , P/B ratio are 32.61 and -1.7 while median values are 20.8 and -1.06 respectively. P/E and P/B are the ratio of company's stock price per share by earnings per share & by (total assets - total liabilities) per share. * The average P/B ratio being negative indicates there are some companies in serious financial stress with liabilities exceeding assets (left skewed)

Exploratory Data Analysis (EDA)¶

  • EDA is an important part of any project involving data.
  • It is important to investigate and understand the data better before building a model with it.
  • A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
  • A thorough analysis of the data, in addition to the questions mentioned below, should be done.

Questions:

  1. What does the distribution of stock prices look like?
  2. The stocks of which economic sector have seen the maximum price increase on average?
  3. How are the different variables correlated with each other?
  4. Cash ratio provides a measure of a company's ability to cover its short-term obligations using only cash and cash equivalents. How does the average cash ratio vary across economic sectors?
  5. P/E ratios can help determine the relative value of a company's shares as they signify the amount of money an investor is willing to invest in a single share of a company per dollar of its earnings. How does the P/E ratio vary, on average, across economic sectors?
In [21]:
# function to plot a boxplot and a histogram along the same scale.


def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (12,7))
    kde: whether to the show density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
    )  # boxplot will be created and a star will indicate the mean value of the column
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )  # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    )  # Add median to the histogram
In [22]:
# selecting numerical columns
num_cols = df.select_dtypes(include=np.number).columns.tolist()
In [23]:
histogram_boxplot(df, num_cols[0], bins=50, kde=True, figsize=(10, 5))
  • Current_Price is right skewed with several positive outliers
In [24]:
histogram_boxplot(df, num_cols[1], bins=50, kde=True, figsize=(10, 5))
  • Price_Change has a somewhat normal distribution with some negative and positive outliers
In [25]:
histogram_boxplot(df, num_cols[2], bins=50, kde=True, figsize=(10, 5))
  • Volatility is right skewed with some positive outliers
In [26]:
histogram_boxplot(df, num_cols[3], bins=50, kde=True, figsize=(10, 5))
  • ROE is right skewed with several positive outliers
In [27]:
histogram_boxplot(df, num_cols[4], bins=50, kde=True, figsize=(10, 5))
  • Cash_Ratio is right skewed with some positive outliers
In [28]:
histogram_boxplot(df, num_cols[5], bins=50, kde=True, figsize=(10, 5))
  • Net_Cash_Flow has a normal distribution with several positive and negative outliers
In [29]:
histogram_boxplot(df, num_cols[6], bins=50, kde=True, figsize=(10, 5))
  • Net_Income has a normal distribution with some positive and a few negative outliers
In [30]:
histogram_boxplot(df, num_cols[7], bins=50, kde=True, figsize=(10, 5))
  • Earnings_Per_Share has a normal distribution with some positive and negative outliers
In [31]:
histogram_boxplot(df, num_cols[8], bins=50, kde=True, figsize=(10, 5))
  • Estimated_Shares_Outstanding is right skewed with several positive outliers
In [32]:
histogram_boxplot(df, num_cols[9], bins=50, kde=True, figsize=(10, 5))
  • P/E_Ratio is right skewed with some positive outliers
In [33]:
histogram_boxplot(df, num_cols[10], bins=50, kde=True, figsize=(10, 5))
  • P/B_Ratio has a normal distribution with a few positive and negative outliers
In [34]:
# function to create labeled barplots


def labeled_barplot(data, feature, perc=False, n=None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])  # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 1, 5))
    else:
        plt.figure(figsize=(n + 1, 5))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(
        data=data,
        x=feature,
        palette="Paired",
        order=data[feature].value_counts().index[:n].sort_values(),
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()  # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  # annotate the percentage

    plt.show()  # show the plot
In [35]:
labeled_barplot(df, "GICS_Sector", perc=True)
  • Majority of the securities belong to "Industries" GICS Sector and minority to "Telecommunication Services"
In [36]:
pd.crosstab(df.GICS_Sub_Industry, df.GICS_Sector).style.highlight_max(
    color="lightgreen", axis=0
)
Out[36]:
GICS_Sector Consumer Discretionary Consumer Staples Energy Financials Health Care Industrials Information Technology Materials Real Estate Telecommunications Services Utilities
GICS_Sub_Industry                      
Advertising 2 0 0 0 0 0 0 0 0 0 0
Aerospace & Defense 0 0 0 0 0 4 0 0 0 0 0
Agricultural Products 0 1 0 0 0 0 0 0 0 0 0
Air Freight & Logistics 0 0 0 0 0 3 0 0 0 0 0
Airlines 0 0 0 0 0 5 0 0 0 0 0
Alternative Carriers 0 0 0 0 0 0 0 0 0 1 0
Apparel, Accessories & Luxury Goods 1 0 0 0 0 0 0 0 0 0 0
Application Software 0 0 0 0 0 0 2 0 0 0 0
Asset Management & Custody Banks 0 0 0 4 0 0 0 0 0 0 0
Auto Parts & Equipment 2 0 0 0 0 0 0 0 0 0 0
Automobile Manufacturers 2 0 0 0 0 0 0 0 0 0 0
Banks 0 0 0 10 0 0 0 0 0 0 0
Biotechnology 0 0 0 0 7 0 0 0 0 0 0
Brewers 0 1 0 0 0 0 0 0 0 0 0
Broadcasting & Cable TV 2 0 0 0 0 0 0 0 0 0 0
Building Products 0 0 0 0 0 4 0 0 0 0 0
Cable & Satellite 3 0 0 0 0 0 0 0 0 0 0
Casinos & Gaming 1 0 0 0 0 0 0 0 0 0 0
Computer Hardware 0 0 0 0 0 0 1 0 0 0 0
Construction & Farm Machinery & Heavy Trucks 0 0 0 0 0 3 0 0 0 0 0
Construction Materials 0 0 0 0 0 0 0 2 0 0 0
Consumer Electronics 1 0 0 0 0 0 0 0 0 0 0
Consumer Finance 0 0 0 5 0 0 0 0 0 0 0
Copper 0 0 0 0 0 0 0 1 0 0 0
Data Processing & Outsourced Services 0 0 0 0 0 0 2 0 0 0 0
Distributors 1 0 0 0 0 0 0 0 0 0 0
Diversified Chemicals 0 0 0 0 0 0 0 5 0 0 0
Diversified Commercial Services 0 0 0 0 0 1 0 0 0 0 0
Diversified Financial Services 0 0 0 7 0 0 0 0 0 0 0
Drug Retail 0 1 0 0 0 0 0 0 0 0 0
Electric Utilities 0 0 0 0 0 0 0 0 0 0 12
Electrical Components & Equipment 0 0 0 0 0 1 0 0 0 0 0
Electronic Components 0 0 0 0 0 0 2 0 0 0 0
Electronic Equipment & Instruments 0 0 0 0 0 0 1 0 0 0 0
Environmental Services 0 0 0 0 0 1 0 0 0 0 0
Fertilizers & Agricultural Chemicals 0 0 0 0 0 0 0 2 0 0 0
Financial Exchanges & Data 0 0 0 1 0 0 0 0 0 0 0
Gold 0 0 0 0 0 0 0 1 0 0 0
Health Care Distributors 0 0 0 0 3 0 0 0 0 0 0
Health Care Equipment 0 0 0 0 11 0 0 0 0 0 0
Health Care Facilities 0 0 0 0 5 0 0 0 0 0 0
Health Care Supplies 0 0 0 0 2 0 0 0 0 0 0
Home Entertainment Software 0 0 0 0 0 0 1 0 0 0 0
Home Furnishings 1 0 0 0 0 0 0 0 0 0 0
Homebuilding 2 0 0 0 0 0 0 0 0 0 0
Hotels, Resorts & Cruise Lines 4 0 0 0 0 0 0 0 0 0 0
Household Appliances 1 0 0 0 0 0 0 0 0 0 0
Household Products 0 3 0 0 0 0 0 0 0 0 0
Housewares & Specialties 1 0 0 0 0 0 0 0 0 0 0
Human Resource & Employment Services 0 0 0 0 0 1 0 0 0 0 0
IT Consulting & Other Services 0 0 0 0 0 0 3 0 0 0 0
Industrial Conglomerates 0 0 0 0 0 14 0 0 0 0 0
Industrial Gases 0 0 0 0 0 0 0 1 0 0 0
Industrial Machinery 0 0 0 0 0 5 0 0 0 0 0
Industrial Materials 0 0 0 0 0 1 0 0 0 0 0
Insurance Brokers 0 0 0 3 0 0 0 0 0 0 0
Integrated Oil & Gas 0 0 5 0 0 0 0 0 0 0 0
Integrated Telecommunications Services 0 0 0 0 0 0 0 0 0 4 0
Internet & Direct Marketing Retail 4 0 0 0 0 0 0 0 0 0 0
Internet Software & Services 0 0 0 0 0 0 12 0 0 0 0
Investment Banking & Brokerage 0 0 0 2 0 0 0 0 0 0 0
Leisure Products 2 0 0 0 0 0 0 0 0 0 0
Life & Health Insurance 0 0 0 3 0 0 0 0 0 0 0
Life Sciences Tools & Services 0 0 0 0 1 0 0 0 0 0 0
Managed Health Care 0 0 0 0 5 0 0 0 0 0 0
Metal & Glass Containers 0 0 0 0 0 0 0 1 0 0 0
Motorcycle Manufacturers 1 0 0 0 0 0 0 0 0 0 0
Multi-Sector Holdings 0 0 0 1 0 0 0 0 0 0 0
Multi-line Insurance 0 0 0 1 0 0 0 0 0 0 0
MultiUtilities 0 0 0 0 0 0 0 0 0 0 11
Networking Equipment 0 0 0 0 0 0 1 0 0 0 0
Office REITs 0 0 0 0 0 0 0 0 1 0 0
Oil & Gas Equipment & Services 0 0 3 0 0 0 0 0 0 0 0
Oil & Gas Exploration & Production 0 0 16 0 0 0 0 0 0 0 0
Oil & Gas Refining & Marketing & Transportation 0 0 6 0 0 0 0 0 0 0 0
Packaged Foods & Meats 0 6 0 0 0 0 0 0 0 0 0
Paper Packaging 0 0 0 0 0 0 0 2 0 0 0
Personal Products 0 1 0 0 0 0 0 0 0 0 0
Pharmaceuticals 0 0 0 0 6 0 0 0 0 0 0
Property & Casualty Insurance 0 0 0 8 0 0 0 0 0 0 0
Publishing 1 0 0 0 0 0 0 0 0 0 0
REITs 0 0 0 0 0 0 0 0 14 0 0
Railroads 0 0 0 0 0 4 0 0 0 0 0
Real Estate Services 0 0 0 0 0 0 0 0 1 0 0
Regional Banks 0 0 0 3 0 0 0 0 0 0 0
Research & Consulting Services 0 0 0 0 0 4 0 0 0 0 0
Residential REITs 0 0 0 0 0 0 0 0 4 0 0
Restaurants 3 0 0 0 0 0 0 0 0 0 0
Retail REITs 0 0 0 0 0 0 0 0 4 0 0
Semiconductor Equipment 0 0 0 0 0 0 1 0 0 0 0
Semiconductors 0 0 0 0 0 0 6 0 0 0 0
Soft Drinks 0 4 0 0 0 0 0 0 0 0 0
Specialized REITs 0 0 0 0 0 0 0 0 3 0 0
Specialty Chemicals 0 0 0 0 0 0 0 4 0 0 0
Specialty Retail 1 0 0 0 0 0 0 0 0 0 0
Specialty Stores 3 0 0 0 0 0 0 0 0 0 0
Steel 0 0 0 0 0 0 0 1 0 0 0
Technology Hardware, Storage & Peripherals 0 0 0 0 0 0 1 0 0 0 0
Technology, Hardware, Software and Supplies 0 0 0 0 0 1 0 0 0 0 0
Thrifts & Mortgage Finance 0 0 0 1 0 0 0 0 0 0 0
Tires & Rubber 1 0 0 0 0 0 0 0 0 0 0
Tobacco 0 2 0 0 0 0 0 0 0 0 0
Trucking 0 0 0 0 0 1 0 0 0 0 0
Water Utilities 0 0 0 0 0 0 0 0 0 0 1
  • Majority of the securities belong to "Oil & Gas Exploration & Production" GICS_Sub_Industry under "Energy" GICS_Sector
In [37]:
# Bivariate analysis
# Stock Price_Change Vs. GICS_Sector

plt.figure(figsize=(20,6))
sns.boxplot(data = df, y = "Price_Change", x = "GICS_Sector");
  • Real_Estate has seen the minimum variation in Price_Change across different securities it encompasses while Energy GICS_Sector has seen the maximum variation in Price_Change across its securities
  • Healthcare and Information Technology have maximum number of securities with a high positive Price_Change making them more favorable
In [38]:
# Cash_Ratio Vs. GICS_Sector

plt.figure(figsize=(20,8))
sns.boxplot(data = df, y = "Cash_Ratio", x = "GICS_Sector");
  • Real_Estate and Financials have the minimum Cash_Ratio variance across securities/companies in the respective GICS_sector while Informational Technology and Health Care have the maximum Cash_Ratio variance
  • Informational_Sector and Financials also have high median Cash_Ratios in comparison to other GICS_Sectors; Informational_Technology has some of the highest positive outliers for Cash_Ratio
In [39]:
# P/E_Ratio Vs. GICS_Sector

plt.figure(figsize=(20,8))
sns.boxplot(data = df, y = "P/E_Ratio", x = "GICS_Sector");
  • Energy GICS_Sector has some of the highest variance in P/E ratios as well has some securities/companies with high P/E ratios. This indicates an investor is willing to invest more in a single share of a company in this sector per dollar of its earnings as opposed to securities/companies in other GICS_Sectors
In [40]:
# check for correlations

plt.figure(figsize=(15, 7))
sns.heatmap(
    df[num_cols].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral"
)
plt.show()
  • Price_Change has a negative correlation with Volatility (standard devaition of stock price, accounting for any big swings in price in either direction over the period). This indicates that more volatile the stocks, more 'negative' the Price_Change
  • Earnings_Per_Share has a positive correlation with Current_Price & Net_Income
  • Estimated_Shares_Outstanding has a positive correlation with Net_Income
  • Earnings_Per_Share has a negative correlation with ROE and Volatility
In [41]:
# Pair-plot analysis

sns.pairplot(df[num_cols],diag_kind="kde")
Out[41]:
<seaborn.axisgrid.PairGrid at 0x7b3af4c25ba0>
  • P/E_Ratio is bimodal; Volatility is trimodal. Most other distributions have somewhat of a normal distribution

Data Preprocessing¶

  • Duplicate value check
  • Missing value treatment
  • Outlier check
  • Feature engineering (if needed)
  • Any other preprocessing steps (if needed)
In [42]:
# lets check duplicate observations
df.duplicated().sum()
Out[42]:
0
  • There are no duplicate records in the dataset
In [43]:
# lets check misssing values
df.isnull().sum().sum()
Out[43]:
0
  • There are no null or missing values in the dataset
In [44]:
# Scaling the data to bring it to the same scale

sc = StandardScaler()
subset_scaled_df = pd.DataFrame(
    sc.fit_transform(df.drop(["Security", "GICS_Sector", "GICS_Sub_Industry"], axis=1)),
    columns=df.drop(["Security", "GICS_Sector", "GICS_Sub_Industry"], axis=1).columns,
)
subset_scaled_df.head()
Out[44]:
Current_Price Price_Change Volatility ROE Cash_Ratio Net_Cash_Flow Net_Income Earnings_Per_Share Estimated_Shares_Outstanding P/E_Ratio P/B_Ratio
0 -0.393341 0.493950 0.272749 0.989601 -0.210698 -0.339355 1.554415 1.309399 0.107863 -0.652487 -0.506653
1 -0.220837 0.355439 1.137045 0.937737 0.077269 -0.002335 0.927628 0.056755 1.250274 -0.311769 -0.504205
2 -0.367195 0.602479 -0.427007 -0.192905 -0.033488 0.454058 0.744371 0.024831 1.098021 -0.391502 0.094941
3 0.133567 0.825696 -0.284802 -0.317379 1.218059 -0.152497 -0.219816 -0.230563 -0.091622 0.947148 0.424333
4 -0.260874 -0.492636 0.296470 -0.265515 2.237018 0.133564 -0.202703 -0.374982 1.978399 3.293307 0.199196

EDA¶

  • It is a good idea to explore the data once again after manipulating it.
In [45]:
# Pair-plot analysis

sns.pairplot(subset_scaled_df ,diag_kind="kde");
  • After feature engineering (scalar transformation), the relationship between the attributes have been maintained. However the attributes are now all on the same scale, with an average of 0, standard deviation of 1
  • The data has no missing values, nor duplicate entries. Outliers have been identified, but not treated and they are assumed to be real data points & not anomalies in this context for modeling

K-means Clustering¶

In [46]:
clusters=range(1,9)
meanDistortions=[]

for k in clusters:
    model=KMeans(n_clusters=k)
    model.fit(subset_scaled_df)
    prediction=model.predict(subset_scaled_df)
    distortion=sum(np.min(cdist(subset_scaled_df, model.cluster_centers_, 'euclidean'), axis=1)) / subset_scaled_df.shape[0]

    meanDistortions.append(distortion)

    print('Number of Clusters:', k, '\tAverage Distortion:', distortion)

plt.plot(clusters, meanDistortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Average Distortion')
plt.title('Selecting k with the Elbow Method', fontsize=20)
Number of Clusters: 1 	Average Distortion: 2.5425069919221697
Number of Clusters: 2 	Average Distortion: 2.384499097487295
Number of Clusters: 3 	Average Distortion: 2.2683105560042285
Number of Clusters: 4 	Average Distortion: 2.179645269703779
Number of Clusters: 5 	Average Distortion: 2.126829442356109
Number of Clusters: 6 	Average Distortion: 2.0537507716340393
Number of Clusters: 7 	Average Distortion: 2.046245663604377
Number of Clusters: 8 	Average Distortion: 1.9893285845489175
Out[46]:
Text(0.5, 1.0, 'Selecting k with the Elbow Method')
  • Appropriate K seems to be 4 or 5
In [47]:
# checking silhoutte score

sil_score = []
cluster_list = list(range(2,10))
for n_clusters in cluster_list:
    clusterer = KMeans(n_clusters=n_clusters)
    preds = clusterer.fit_predict((subset_scaled_df))
    #centers = clusterer.cluster_centers_
    score = silhouette_score(subset_scaled_df, preds)
    sil_score.append(score)
    print("For n_clusters = {}, silhouette score is {})".format(n_clusters, score))
For n_clusters = 2, silhouette score is 0.43969639509980457)
For n_clusters = 3, silhouette score is 0.45797710447228496)
For n_clusters = 4, silhouette score is 0.45434371948348606)
For n_clusters = 5, silhouette score is 0.4328525384867999)
For n_clusters = 6, silhouette score is 0.41439237473055257)
For n_clusters = 7, silhouette score is 0.4265962625132064)
For n_clusters = 8, silhouette score is 0.41923671063647955)
For n_clusters = 9, silhouette score is 0.3678986004890057)
In [48]:
plt.plot(cluster_list,sil_score)
plt.grid()
  • From Silhouette score, 4 seems to be a good approximate for number of clusters
In [49]:
# Finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(4, random_state = 1))
visualizer.fit(subset_scaled_df)
visualizer.show();
In [50]:
visualizer = SilhouetteVisualizer(KMeans(5, random_state = 1))
visualizer.fit(subset_scaled_df)
visualizer.show();
In [51]:
visualizer = SilhouetteVisualizer(KMeans(6, random_state = 1))
visualizer.fit(subset_scaled_df)
visualizer.show();
In [52]:
# Choosing k=4 clusters
kmeans = KMeans(n_clusters=4, random_state=0)
kmeans.fit(subset_scaled_df)
Out[52]:
KMeans(n_clusters=4, random_state=0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KMeans(n_clusters=4, random_state=0)
In [53]:
df['K_means_segments'] = kmeans.labels_
subset_scaled_df['K_means_segments'] = kmeans.labels_
In [54]:
cluster_profile = df.groupby('K_means_segments').mean()
In [55]:
cluster_profile['count_in_each_segments'] = df.groupby('K_means_segments')['Security'].count().values
In [56]:
cluster_profile
Out[56]:
Current_Price Price_Change Volatility ROE Cash_Ratio Net_Cash_Flow Net_Income Earnings_Per_Share Estimated_Shares_Outstanding P/E_Ratio P/B_Ratio count_in_each_segments
K_means_segments
0 234.170932 13.400685 1.729989 25.600000 277.640000 1.554927e+09 1.572612e+09 6.045200 5.783163e+08 74.960824 14.402452 25
1 38.099260 -15.370329 2.910500 107.074074 50.037037 -1.594285e+08 -3.887458e+09 -9.473704 4.803986e+08 90.619220 1.342067 27
2 50.517273 5.747586 1.130399 31.090909 75.909091 -1.072273e+09 1.483309e+10 4.154545 4.298827e+09 14.803577 -4.552119 11
3 72.399112 5.066225 1.388319 34.620939 53.000000 -1.404622e+07 1.482212e+09 3.621029 4.385338e+08 23.843656 -3.358948 277
In [57]:
fig, axes = plt.subplots(3, 4,  figsize=(20, 16))
fig.suptitle('Boxplot of numerical variables for each cluster', fontsize=20)
counter = 0
for ii in range(3):
    for jj in range(4):
        if counter <11:
            sns.boxplot(ax=axes[ii, jj],y=subset_scaled_df[num_cols[counter]],x=subset_scaled_df['K_means_segments'])
        counter = counter+1

fig.tight_layout(pad=2.0)

Insights¶

  • Cluster 0

    • Has about 25 different securities
    • This cluster has securities with very high average Current_Price of $234
    • As well, cluster has securities with very high Cash_Ratio of 277, high earnings per share of 6.05 and high P/B_Ratio of 14.4
  • Cluster 1

    • Has about 27 different securities
    • This cluster has securities with among the lowest average Current_Price of $38 and negative average Price_Change of -15.3
    • As well, cluster has securities with among the highest volatility of 2.9, making them more riskier
    • This cluster also has the lowest Cash_Ratio of 50,low Net_Income (i.e., negative) of -3.887458e+09, low Earnings_Per_Share (i.e., negative) of -9.4
  • Cluster 3

    • Has the majority of securities at 277
    • This cluster has securities that lie in between Cluster 0 & Cluster 1 i.e, intermediate Current_Price of $72, intermediate Price_Change of 5.066 (neither positive & high like cluster 0 nor negative and high like cluster 2), and intermediate volatility
  • Cluster 2

    • Has only 11 different securities
    • This cluster is very similar to Cluster 3 in it's behavior but with two striking differences:

      Has securities with some of the highest average Net_Income of 1.483309e+10 (almost 10 times of those in Cluster 3)

      Has securities with some of the highest Estimated_Shares_Outstanding of 4.298827e+09 (almost 10 times of those in Cluster 3)

Clusters 3 and 2 are the safe clusters, with clusters 2 containing more exclusive securities. Clusters 0 and 1 are more riskier securities, former being high performing and later historically speaking low performing

In [58]:
# Comparing cluster vs. GICS_Sector

pd.crosstab(df.GICS_Sector, df.K_means_segments).style.highlight_max(color = 'lightgreen', axis = 0)
Out[58]:
K_means_segments 0 1 2 3
GICS_Sector        
Consumer Discretionary 6 0 1 33
Consumer Staples 1 0 1 17
Energy 1 22 1 6
Financials 1 0 3 45
Health Care 9 0 2 29
Industrials 0 1 0 52
Information Technology 5 3 1 24
Materials 0 1 0 19
Real Estate 1 0 0 26
Telecommunications Services 1 0 2 2
Utilities 0 0 0 24
  • Among the securities in Cluster 0, majority are Health Care, followed by Consumer Discretionary & Informational Technology
  • Cluster 1 is dominated by securities belonging to Energy division
  • Cluster 2 is predominantly Financials
  • Cluster 3 is diversified with Securities predominantly belonging to Industrials followed by Financials, Consumer Discretionary, Real Estate & Informational Technology

Hierarchical Clustering¶

In [59]:
# list of distance metrics
distance_metrics = ["euclidean", "chebyshev", "mahalanobis", "cityblock"]

# list of linkage methods
linkage_methods = ["single", "complete", "average", "weighted"]

high_cophenet_corr = 0
high_dm_lm = [0, 0]

for dm in distance_metrics:
    for lm in linkage_methods:
        Z = linkage(subset_scaled_df, metric=dm, method=lm)
        c, coph_dists = cophenet(Z, pdist(subset_scaled_df))
        print(
            "Cophenetic correlation for {} distance and {} linkage is {}".format(
                dm.capitalize(), lm, c
            )
        )
        if high_cophenet_corr < c:
            high_cophenet_corr = c
            high_dm_lm[0] = dm
            high_dm_lm[1] = lm
Cophenetic correlation for Euclidean distance and single linkage is 0.9304469769832865
Cophenetic correlation for Euclidean distance and complete linkage is 0.8559480642212798
Cophenetic correlation for Euclidean distance and average linkage is 0.946403836884538
Cophenetic correlation for Euclidean distance and weighted linkage is 0.7508819056084053
Cophenetic correlation for Chebyshev distance and single linkage is 0.9161627445317929
Cophenetic correlation for Chebyshev distance and complete linkage is 0.822502094153258
Cophenetic correlation for Chebyshev distance and average linkage is 0.9379218754329659
Cophenetic correlation for Chebyshev distance and weighted linkage is 0.9153206618543516
Cophenetic correlation for Mahalanobis distance and single linkage is 0.9348505176633238
Cophenetic correlation for Mahalanobis distance and complete linkage is 0.6881861661402056
Cophenetic correlation for Mahalanobis distance and average linkage is 0.9360657692078034
Cophenetic correlation for Mahalanobis distance and weighted linkage is 0.8810701545336993
Cophenetic correlation for Cityblock distance and single linkage is 0.938373245895409
Cophenetic correlation for Cityblock distance and complete linkage is 0.8124007660644492
Cophenetic correlation for Cityblock distance and average linkage is 0.9168123859372297
Cophenetic correlation for Cityblock distance and weighted linkage is 0.866729262879581
In [60]:
# printing the combination of distance metric and linkage method with the highest cophenetic correlation
print(
    "Highest cophenetic correlation is {}, which is obtained with {} distance and {} linkage".format(
        high_cophenet_corr, high_dm_lm[0].capitalize(), high_dm_lm[1]
    )
)
Highest cophenetic correlation is 0.946403836884538, which is obtained with Euclidean distance and average linkage
In [61]:
# list of linkage methods
linkage_methods = ["single", "complete", "average", "centroid", "ward", "weighted"]

high_cophenet_corr = 0
high_dm_lm = [0, 0]

for lm in linkage_methods:
    Z = linkage(subset_scaled_df, metric="euclidean", method=lm)
    c, coph_dists = cophenet(Z, pdist(subset_scaled_df))
    print("Cophenetic correlation for {} linkage is {}".format(lm, c))
    if high_cophenet_corr < c:
        high_cophenet_corr = c
        high_dm_lm[0] = "euclidean"
        high_dm_lm[1] = lm
Cophenetic correlation for single linkage is 0.9304469769832865
Cophenetic correlation for complete linkage is 0.8559480642212798
Cophenetic correlation for average linkage is 0.946403836884538
Cophenetic correlation for centroid linkage is 0.9494262703881242
Cophenetic correlation for ward linkage is 0.7436374975239648
Cophenetic correlation for weighted linkage is 0.7508819056084053
In [62]:
# printing the combination of distance metric and linkage method with the highest cophenetic correlation
print(
    "Highest cophenetic correlation is {}, which is obtained with {} linkage".format(
        high_cophenet_corr, high_dm_lm[1]
    )
)
Highest cophenetic correlation is 0.9494262703881242, which is obtained with centroid linkage
In [63]:
# list of linkage methods
linkage_methods = ["single", "complete", "average", "centroid", "ward", "weighted"]

# lists to save results of cophenetic correlation calculation
compare_cols = ["Linkage", "Cophenetic Coefficient"]

# to create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30))

# We will enumerate through the list of linkage methods above
# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
    Z = linkage(subset_scaled_df, metric="euclidean", method=method)

    dendrogram(Z, ax=axs[i])
    axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)")

    coph_corr, coph_dist = cophenet(Z, pdist(subset_scaled_df))
    axs[i].annotate(
        f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
        (0.80, 0.80),
        xycoords="axes fraction",
    )
  • Cophenetic correlation is highest with eucledian distance & average or centroid linkages
  • 6 appears to be the appropriate number of clusters from the dendrogram for average linkage
In [64]:
HCmodel = AgglomerativeClustering(n_clusters=6, affinity="euclidean", linkage="average")
HCmodel.fit(subset_scaled_df)
Out[64]:
AgglomerativeClustering(affinity='euclidean', linkage='average', n_clusters=6)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
AgglomerativeClustering(affinity='euclidean', linkage='average', n_clusters=6)
In [65]:
subset_scaled_df["HC_Clusters"] = HCmodel.labels_
df["HC_Clusters"] = HCmodel.labels_
In [66]:
cluster_profile = df.groupby("HC_Clusters").mean()
In [67]:
cluster_profile["count_in_each_segments"] = (
    df.groupby("HC_Clusters")["Security"].count().values
)
In [68]:
cluster_profile
Out[68]:
Current_Price Price_Change Volatility ROE Cash_Ratio Net_Cash_Flow Net_Income Earnings_Per_Share Estimated_Shares_Outstanding P/E_Ratio P/B_Ratio K_means_segments count_in_each_segments
HC_Clusters
0 77.287589 4.099730 1.518066 35.336336 66.900901 -3.319732e+07 1.538075e+09 2.88527 5.605050e+08 32.441706 -2.174921 2.630631 333
1 25.640000 11.237908 1.322355 12.500000 130.500000 1.675550e+10 1.365400e+10 3.29500 2.791829e+09 13.649696 1.508484 1.000000 2
2 24.485001 -13.351992 3.482611 802.000000 51.000000 -1.292500e+09 -1.910650e+10 -41.81500 5.195740e+08 60.748608 1.565141 1.000000 2
3 104.660004 16.224320 1.320606 8.000000 958.000000 5.920000e+08 3.669000e+09 1.31000 2.800763e+09 79.893133 5.884467 0.000000 1
4 1274.949951 3.190527 1.268340 29.000000 184.000000 -1.671386e+09 2.551360e+09 50.09000 5.093552e+07 25.453183 -1.052429 0.000000 1
5 276.570007 6.189286 1.116976 30.000000 25.000000 9.088500e+07 5.965410e+08 8.91000 6.695185e+07 31.040405 129.064585 0.000000 1
  • There are 5 clusters, each with 2, 2, 1, 1, and 1 securities. This clustering does not look good as the clusters do not have enough variability
  • Re-trying using Ward linkage as it has more distinct and separated clusters (as seen from it's dendrogram before). 4 appears to be the appropriate number of clusters from the dendrogram for Ward linkage
In [69]:
HCmodel = AgglomerativeClustering(n_clusters=4, affinity="euclidean", linkage="ward")
HCmodel.fit(subset_scaled_df)
Out[69]:
AgglomerativeClustering(affinity='euclidean', n_clusters=4)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
AgglomerativeClustering(affinity='euclidean', n_clusters=4)
In [70]:
subset_scaled_df["HC_Clusters"] = HCmodel.labels_
df["HC_Clusters"] = HCmodel.labels_
In [71]:
cluster_profile = df.groupby("HC_Clusters").mean()
In [72]:
cluster_profile["count_in_each_segments"] = (
    df.groupby("HC_Clusters")["Security"].count().values
)
In [73]:
cluster_profile
Out[73]:
Current_Price Price_Change Volatility ROE Cash_Ratio Net_Cash_Flow Net_Income Earnings_Per_Share Estimated_Shares_Outstanding P/E_Ratio P/B_Ratio K_means_segments count_in_each_segments
HC_Clusters
0 46.558126 -11.798670 2.617878 178.750000 50.250000 4.349716e+07 -3.197472e+09 -7.785312 4.732895e+08 72.496532 -0.780467 1.281250 32
1 71.846974 4.953643 1.392784 25.117216 53.831502 1.197788e+06 1.557674e+09 3.691044 4.439183e+08 23.583804 -3.087957 2.996337 273
2 229.579357 14.049986 1.735216 25.423077 268.423077 1.712688e+09 1.981882e+09 5.946923 7.219242e+08 84.216911 13.114240 0.115385 26
3 46.672222 5.166566 1.079367 25.000000 58.333333 -3.040667e+09 1.484844e+10 3.435556 4.564960e+09 15.596051 -6.354193 2.000000 9
  • Now, the clusters appear to have more variability
In [74]:
# let's see the names of the securities in each cluster
for cl in df["HC_Clusters"].unique():
    print(
        "The",
        df[df["HC_Clusters"] == cl]["Security"].nunique(),
        "Securities in cluster",
        cl,
        "are:",
    )
    print(df[df["HC_Clusters"] == cl]["Security"].unique())
    print("-" * 100, "\n")
The 273 Securities in cluster 1 are:
['American Airlines Group', 'AbbVie', 'Abbott Laboratories', 'Adobe Systems Inc', 'Archer-Daniels-Midland Co', ..., 'Xylem Inc.', 'Yum! Brands Inc', 'Zimmer Biomet Holdings', 'Zions Bancorp', 'Zoetis']
Length: 273
Categories (340, object): ['3M Company', 'AFLAC Inc', 'AMETEK Inc', 'AT&T Inc', ...,
                           'Zimmer Biomet Holdings', 'Zions Bancorp', 'Zoetis', 'eBay Inc.']
---------------------------------------------------------------------------------------------------- 

The 26 Securities in cluster 2 are:
['Analog Devices, Inc.', 'Alliance Data Systems', 'Alexion Pharmaceuticals', 'Amgen Inc', 'Amazon.com Inc', ..., 'TripAdvisor', 'Vertex Pharmaceuticals Inc', 'Waters Corporation', 'Wynn Resorts Ltd', 'Yahoo Inc.']
Length: 26
Categories (340, object): ['3M Company', 'AFLAC Inc', 'AMETEK Inc', 'AT&T Inc', ...,
                           'Zimmer Biomet Holdings', 'Zions Bancorp', 'Zoetis', 'eBay Inc.']
---------------------------------------------------------------------------------------------------- 

The 32 Securities in cluster 0 are:
['Allegion', 'Apache Corporation', 'Anadarko Petroleum Corp', 'Baker Hughes Inc', 'Chesapeake Energy', ..., 'S&P Global, Inc.', 'Southwestern Energy', 'Teradata Corp.', 'Williams Cos.', 'Cimarex Energy']
Length: 32
Categories (340, object): ['3M Company', 'AFLAC Inc', 'AMETEK Inc', 'AT&T Inc', ...,
                           'Zimmer Biomet Holdings', 'Zions Bancorp', 'Zoetis', 'eBay Inc.']
---------------------------------------------------------------------------------------------------- 

The 9 Securities in cluster 3 are:
['Citigroup Inc.', 'Ford Motor', 'JPMorgan Chase & Co.', 'Coca Cola Company', 'Pfizer Inc.', 'AT&T Inc', 'Verizon Communications', 'Wells Fargo', 'Exxon Mobil Corp.']
Categories (340, object): ['3M Company', 'AFLAC Inc', 'AMETEK Inc', 'AT&T Inc', ...,
                           'Zimmer Biomet Holdings', 'Zions Bancorp', 'Zoetis', 'eBay Inc.']
---------------------------------------------------------------------------------------------------- 

In [75]:
fig, axes = plt.subplots(3, 4,  figsize=(20, 16))
fig.suptitle('Boxplot of numerical variables for each cluster', fontsize=20)
counter = 0
for ii in range(3):
    for jj in range(4):
        if counter <11:
            sns.boxplot(ax=axes[ii, jj],y=subset_scaled_df[num_cols[counter]],x=subset_scaled_df['HC_Clusters'])
        counter = counter+1

fig.tight_layout(pad=2.0)

Insights¶

Although, minor differences here and there, groupings obtained with Hierarchical clustering using Eucledian distance & Ward linkage is similar to the one obtained using K-Means clustering!

  • Cluster 0 of Hierarchical clustering

    • Similar to cluster 1 of K-Means clustering
  • Cluster 1 of Hierarchical clustering

    • Similar to cluster 3 of K-Means clustering
  • Cluster 2 of Hierarchical clustering

    • Similar to cluster 0 of K-Means clustering
  • Cluster 3 of Hierarchical clustering

    • Similar to cluster 2 of K-Means clustering
In [76]:
# Comparing cluster vs. GICS_Sector

pd.crosstab(df.GICS_Sector, df.HC_Clusters).style.highlight_max(color = 'lightgreen', axis = 0)
Out[76]:
HC_Clusters 0 1 2 3
GICS_Sector        
Consumer Discretionary 1 32 6 1
Consumer Staples 2 15 1 1
Energy 23 6 0 1
Financials 1 44 1 3
Health Care 0 30 9 1
Industrials 2 51 0 0
Information Technology 2 24 7 0
Materials 1 19 0 0
Real Estate 0 26 1 0
Telecommunications Services 0 2 1 2
Utilities 0 24 0 0
  • The results are quite similar to as was observed with K-Means clustering

K-means vs Hierarchical Clustering¶

You compare several things, like:

  • Which clustering technique took less time for execution?
  • Which clustering technique gave you more distinct clusters, or are they the same?
  • How many observations are there in the similar clusters of both algorithms?
  • How many clusters are obtained as the appropriate number of clusters from both algorithms?

You can also mention any differences or similarities you obtained in the cluster profiles from both the clustering techniques.

In [77]:
import time


# Initialize K-means clustering with the 4 clusters
n_clusters = 4
kmeans = KMeans(n_clusters=n_clusters)

# Measure the execution time
start_time = time.time()
kmeans.fit(subset_scaled_df)
end_time = time.time()

# Calculate the execution time
execution_time = end_time - start_time

print(f"K-means clustering took {execution_time:.4f} seconds")
K-means clustering took 0.0859 seconds
In [78]:
# Measure the execution time for hierarchical clustering
start_time = time.time()

# Perform hierarchical clustering
HCmodel = AgglomerativeClustering(n_clusters=4, affinity="euclidean", linkage="ward")
HCmodel.fit(subset_scaled_df)

# End timing
end_time = time.time()

# Calculate the execution time
execution_time = end_time - start_time

print(f"Hierarchical clustering took {execution_time:.4f} seconds")
Hierarchical clustering took 0.0110 seconds
  • Hierarchical Clustering took less time for execution
In [79]:
# Comparing Hierarchical cluster vs. K-means  cluster

pd.crosstab(df.K_means_segments, df.HC_Clusters).style.highlight_max(color = 'lightgreen', axis = 0)
Out[79]:
HC_Clusters 0 1 2 3
K_means_segments        
0 1 0 24 0
1 26 0 1 0
2 0 1 1 9
3 5 272 0 0
  • The 4 clusters as identified by K-Means and Hierarchical Clustering are able to group (26+272+24+9 i.e.,) 331 securities out of 340 securities in an identical fashion. PCA has been performed to visualize the clusters from the two algorithms in 2 dimensional space

Dimensionality Reduction using PCA for visualization¶

In [80]:
# PCA to reduce the data to two dimensions and visualize it to see how well-separated the clusters are

# setting the number of components to 2
pca = PCA(n_components=2)

# transforming data and storing results in a dataframe
X_reduced_pca = pca.fit_transform(subset_scaled_df)
reduced_df_pca = pd.DataFrame(
    data=X_reduced_pca, columns=["Component 1", "Component 2"]
)
In [81]:
# checking the amount of variance explained
pca.explained_variance_ratio_.sum()
Out[81]:
0.38933325456234846
  • The first two principal components explain 38.9% of the variance in the data

Scatterplot by cluster labels - K-Means Clustering¶

In [82]:
sns.scatterplot(
    data=reduced_df_pca,
    x="Component 1",
    y="Component 2",
    hue=df["K_means_segments"],
    palette="rainbow",
)
plt.legend(bbox_to_anchor=(1, 1))
Out[82]:
<matplotlib.legend.Legend at 0x7b3af6a843d0>

Scatterplot by cluster labels - HierarchicalClustering¶

In [83]:
sns.scatterplot(
    data=reduced_df_pca,
    x="Component 1",
    y="Component 2",
    hue=df["HC_Clusters"],
    palette="rainbow",
)
plt.legend(bbox_to_anchor=(1, 1))
Out[83]:
<matplotlib.legend.Legend at 0x7b3af70f4c70>
  • The 4 clusters are well separated, with 1 cluster containing majority of data points and other clusters capturing the outliers
  • While there are subtle differences in the plots obtained with K-Means clustering & Hierarchical clustering, broadly, there are more similarities than disimilarities in the obtained clusters

Actionable Insights and Recommendations¶

-

Insights

Exploratory Data Analysis

  • Current_Price of stocks, and Estimated_Shares_Outstanding across securities for all sectors is right skewed (with several positive outliers)

  • Health Care and Financial sectors have seen some of the highest positive Price_Change in the last 13 weeks, making them favorable to investors

  • Informational Technology and Financial sectors have some of the highest Cash_Ratios making them favorable more so than other sectors

  • Real Estate sector has seen minimum variation in Price_Change & minimum variation in Cash_Ratio across securities it encompasses making them a safer investment choice for investors

  • Energy sector has some of the highest variance in Price_Change across securities it encompasses, being more volatile and riskier for investors. However, this sector has securities with high P/E_Ratios. This indicates an investor is willing to invest more in a single share of a company in Energy sector per dollar value of its earnings as opposed to securities in any other sectors

Clustering Profiles

  • Out of 340 securities in our data set, both clustering methods clustered 331 securities in a similar fashion with 9 securities being clustered differently. The industry segregation into clusters yielded similar results across both algorithms. PCA performed for both clustering techniques yielded similar cluster densities in 2D space

  • One cluster (25+securities) was identified as very aggressive (& high performing) belonging predominantly to Health Care followed by Consumer Discretionary and Information Technology sectors

  • Another cluster (25+securities) was identified as very aggressive (but historically low performing) belonging predominantly to Energy sector

  • Another cluster (~10securities) was identified as moderately aggressive (& high performing) belonging predominantly to Financials sector

  • Finally, a major cluster (270+securities) was identified as mildly aggressive & safe investment option. This cluster is diversified with securities predominantly belonging to Industrials, followed by Financials, Consumer Discretionary, Real Estate, & Informational Technology sectors

Recommendations

  • Securities were segregated into 4 different clusters identifying very aggressive (high & low performing), moderately aggressive (& high performing) and mildly aggressive options. This is important in an effort to split the stocks across investments that are diversified, enabling one to maximize earnings in any market condition

  • However, it is important to keep in mind that stock market is often volatile, and past indicators may not always indicate future trends. Dynamic clustering (as more data is added each day) & movement of stocks across cluster groups due to changing market conditions needs to be further analyzed for making better predictions